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(57) Abstract: Mass data are typically not unique -I.e., each experimentally determined mass can match randomly one or several 
molecules in a database. Random matching between mass data and molecules in a database can cause false identification results. In 
order to minimize false results, random matching must be appropriately accounted for in a method for molecule identification. The 
invention pro\ides a method to determine, for any molecule in a database and for any experimental and database search constraints, 
the probability that a particular number of matches between the mass data and masses of molecule constituents results from random 
matching. The method utilizes the detennined probability for random matching to assign scores and rank molecules in a database. 
The invention further provides a method of generating a frequency function of scores for any experimental condition or database 
search constraints, wherein the scores relate to random identifications of molecules. Frequency functions are necessary and sufficient 
tools for testing the significance of a score associated with an identification of an unknown biological molecule. 
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SYSTEM FOR MOLECULE mENTIFICATION 

Field of the Invention 

The present invention relates to a method and tools for the identification of 
unknown molecxiles, and, particularly, a method and tools &>t molecide . 
identification that provide a solution to the problem of random mass matching. 

5 

Backgroimd of the Invention 
Identification of a molecule or several molecules in a sample is a technical 
problem in various fields of research and technology. Molecule identification 
problems can concern e.g. the tracing of unwanted substances in the environment 

10 and the studies of metaboUc pathways and disease-state markers in drug 
development projects. Molecule identification problems can sometimes be solved 
by the appropriate application of instnmients and methods for the acqxiisition 
and processing of data fi:om a sample containing the molecules to be identified. 
One example of data firom a sample is mass data. Molecular or molecular 

15 constituent mass data can be obtained by a variety of techniques including 
techniques such as ultra-centrifugation, electrophoresis, and mass spectrometry. 
Experimental mass data from the sample analyzed is ofben compared with 
database-information about known or hypothetical molecules. 

In particular, mass spectrometry (MS) combined with database searching 

20 has proven to be a useful approach jfor molecule identification. For example, MS 
of protein-digests combined with searching in protein and DNA sequence 
databases is a method of choice for the identification of proteins in proteomics 
projects. The field of proteomics, which include the elucidation of protein function 
under various cell conditions, is beUeved to form a future basis for drug design. 

25 MS-protein identification involves cleavage of proteins with an enzyme having 
high digestion i^ecrficity (usually trypsin), whereupon the resulting proteolytic 
products are subjected to mass analysis by either matrix-assisted laser 
desorption/ionization mass spectrometry (MALDI-MS) or electrospray ionization 
mass spectrometry (ESI-MS). The experimentally determined masses are then 

30 compared with masses of peptides that individual proteins in a database woiild 
yield if they were cleaved by the same en2yme as was used in the experiment. In 
some e3q)eriments, individual proteolytic peptide ions are isolated and subjected 
to fragmentation and firagment mass analysis in the mass spectrometer. The 
resulting firagment masses are then compared with hypothetical proteolytic 

35 peptide fragment masses of the proteins in a database. The protein is identified 
based on an evaluation of either or both of these comparisons. 
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Mass spectrometry determines a peptide mass mi to an accuracy ±Ami, with 
Amilnti typically >30 ppm, WitMn the mass range mthAmi proteolytic peptide 
masses of several proteins in a genome database can match. Hence, an 
xmmodified peptide will match randomly with several proteins in the database, in 
5 addition to the true match with the actual protein present in the sample, and a 
modified peptide will yield only random matches. Consequently, a database 
search using mass spectrometry information will not always identify a protein 
xmambiguously. Therefore, in order to perform accurate and reliable molecule 
identification, instnmients for obtaining mass data must be appropriately linked 
10 with the use of other technical resources for the comparison of mass data and 
mass information obtained fi-om a database. The Unk can be a system that makes 
use of a method including means for comparison of data and database 
information, preferably operated via a computer. 

Despite the rapidly increasing impact of mass spectrometric protein 
15 identification on proteomic research, the problem of accurately taking the 
phenomenon of random mass matching into accoimt in a database search system 
has been overlooked. As increasingly complex processes are explored by MS- 
based protein identification, the use of optimized procedures will become critical. 
An optimized protein identification system cannot be designed without or with 
20 inappropriate account for the random mass-matching process. 

State of the Art 

Identification of proteins by the above-described approach requires a scheme 
for determining the best match between the experimental data and a sequence in 

25 the database. Existing schemes for determining the best match include ranking 
by number of matches (W.J. Henzel et aL, Proc. Natl. Acad. Sci. U S A 90, 5011, 
1993), a scoring system based on the observed frequency of peptides firom all 
proteins in a database in a given molecular weight range (the so-called "MOWSE 
score" (D.J.C. Pappin et al., Cxurent Biology 6, 327,1993), and a scheme based on 

30 Bayesian probabilities (W. Zhang et aL, Anal. Chem. 72, 2482, 2000). 

None of these schemes takes the problem of random mass matching 
appropriately into accoxmt. The lack of an appropriate account for the random 
mass matching hinders optimum performance of molecule identification 
procedures, since the random mass matching can cause fialse identification 

35 results - especially when the quality of the mass spectrometry data is poor. 
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Summary of tlie Invention 
Tlie object of the present invention is to overcome the shortcomings of the 
above-mentioned schemes, i.e., to provide a method that solves the problem of 
random mass matching. 
5 This and other objects have been met by providing a ^tem including 

methods of determining the probability for a particular score due to random mass 
matching of a molecxde, and to utilize the computed probability to rank 
molecxiles. The method comprises: a) determining the number of matches 
between a database molecule and mass data; b) computing the probability that a 

10 database molecule woxild yield a particular number of matches by chance; c) 
computing a score based on one or several probabiHties computed in step (b); c) 
comparing the scores of molecules in a molecule database; and d) identifying the 
molec\xle or molecules that yield(s) the best score(s). 

The invention further provides a method of generating a frequency function 

15 of the number of matches for random (false) molecule identification for any 
experimental condition. The method comprises: a) defining a sub-population of 
the moleciiles contained in a database; b) computing the probabiliiy that a 
molecule in this sub-population would yield a particxilar number of matches by 
chance; c) computing a probabihty that all molecules in the sub-population would 

20 yield at most a particular number of matches by chance; d) computing the 
probability that at least one molecule in the sub-population would 3deld at least a 
particular number of matches by chance; and e) determining the relative 
frequency of each nxmiber of matches by using the probability computed in step 
(d) for each number of matches and generating therefrom a frequency function of 

25 the number of matches for random protein identification. 

Brief Description of the Drawings 
Fig. 1 shows frequencies (i.e., number of matching proteins) of various 
tryptic peptide masses in a database. 
30 Fig. 2 shows mass distribution peaks for tryptic peptides. 

Pig. 3 shows the performance of an implementation of one embodiment of 
the invention in comparison with state of the art systems for protein 
identification. The graph displays results firom simulations emplojdng the 
invention (denoted Probity), a Bayesian method, and a method based on the 
35 number of matches. 
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Pig, 4 shows score frequency functions generated by the invention in 
comparison with score frequency frinctions generated by simulation. 

Detailed Description of the Invention 
5 Many appUcations of molecule identification are inherently large-scale. 

Examples of large-scale molecule identification can be found in proteomics 
projects, where thousands of proteins from cells are to be identified, or cells are 
screened for molecular markers of states of disease. The ultimate goal of molecule 
identification procedures is to rely on simple, rapid and automated procediires 

10 and instrumentation. The technical solutions of the system that links and 
compares mass data with database information are of key importance to the 
design of instruments for automated molecule identification, since, the system 
iised will influence strongly the capability of obtaining a high relative frequency 
of true identification results, which is particularly critical when the quality of the 

15 data is poor. Furthermore, automated identification instrumentation demand 
that the quaUty of identification results is assessed automatically by the use of a 
significance test (J, Eriksson et al.. Anal. Chem. 72, 999, 2000). However, a 
reUable automated protein identification system cannot be designed without or 
with inappropriate account for the random mass-matching process. 

20 One object of the present invention is to provide a system that utilizes 

methods that allow more accvirate molecule identification and more accurate and 
rapid significance testing of identification results. The method according to the 
invention appropriately takes into account the phenomenon of random matching, 
and is therefore well sxiited for implementation in an automated molecule 

25 identification system. 

A particular concern regarding large-scale molecular identification is the 
time required to obtain the identification result together with a quality 
assessment of this result. A quahty assessment can be accomplished by 
significance test, which requires knowledge of functions describing scores for 

30 false results. Such frequency functions are cxurrently obtained by simulation of 
random molecular identification. However, since the time needed to derive a 
frequency function by simulation is about 1000 times longer than with the use of 
the invention, there is need to derive such a frequency function from an 
anal3rtical expression.. In one embodiment of the invention, such an analytical 

35 expression for the derivation of a firequency function is provided. 
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The methods according to the invention are well siiited for, but not limited 
to, applications, in which the molecules are biological molecules that can exist in 
cells of organisms. 

Biological molecules include any biological pol3nner that can be degraded 
into constituent parts. The degradation is preferably into constituent parts at 
predictable positions to form predictable masses. Examples of biological 
molecules include proteins, nucleic acid molecules, polysaccharides and 
carbohydrates. 

An experimental biological molecule is a biological moleciile that is to be 
identified; the experimental biological molecule can also be referred to as an 
unknown biological molecule. A theoretical biological molecule is a biological 
molecule is a known biological niolecule described in a database. 

Proteins are polymers of amino acids. Ck>nstituent parts of proteins comprise 
amino acids. A protein typically contains approximately at least ten amino acids, 
preferably at least 50 amino acids and more preferably at least 100 amino adds. 

Nucleic acids are polymers of nucleotides. Cionstituent parts of nucleic acids 
comprise nucleotides. Typically, a nucleic add contains at least 100 nucleotides, 
preferably at least 500 nudeotides. 

Polysaccharides are polymers of monosaccharides. Constituent parts of 
polysaccharides comprise one or more monosaccharides. Typically, a 
polysaccharide contains at least five monosaccharides, preferably at least ten 
monosaccharide s. 

Mass data of biological molecules are quantifiable information about the 
masses of the constituent parts of the biological molecule. Mass data include 
individual mass spectra and groups of mass spectra. The mass spectra can be in 
the form of peptide maps, oglionucleotide maps or oligosaccharide maps. 

The method of the present invention includes generating experimental mass 
data for the experimental molecule within a certain mass rsuige. Mass data 
include the measured masses. The method also includes generating theoretical 
mass data in the same mass range. In one embodiment, the experimental mass 
data is a subset of the experimental mass data. 

For example, mass data for moleciiles can be generated in any manner that 
provides mass data within certain accuracy. Examples include matrix-assisted 
laser desorption/ionization mass spectrometry, electrospray ionization mass 
spectrometry, chromatography and electrophoresis. Mass data can also be 
generated by a general -pxurpose computer configured by software or otherwise. 
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For the piirposes of the present invention the mass data, for example a 
peptide mass, mi, is determined to an accxiracy ±Amu with Anu/mi preferably 
<10,000 ppm, more preferably <100 ppm, and most preferably <30 ppm. 

A step in generating mass data of a molecule may include first cleaving the 
molecxile into constituent parts. Biological molecules may be cleaved by methods 
known in the art. Preferably, the biological molecides are cleaved into constituent 
parts at predictable positions to form predictable masses. Methods of cleaving 
include chemical degradation of the biological molecules. Biological molecules 
may be degraded by contacting the biological molecule with any chemical 
substance. 

For example, proteins may be predictably degraded into peptides by means 
of cyanogen bromide and enzymes, such as trypsin, endoproteinase Asp-N, V8 
protease, endoproteinase Arg-C, etc. Nucleic adds may be predictably degraded 
into constituent parts by means of restriction endonucleases, such as Eco RI, Sma 
I, BamH I, Hinc II, etc. Polysaccharides may be degraded into constituent parts 
by means of enzymes, such as maltase, amylase, alpha-mannosidase, etc. 

In the present invention a mass range (mmin, miaax) is determined for the ' 
experimental mass data. The mass range can be any mass range of the mass 
data. In one embodiment, the mass range is the minimum and m&tximmn 
measxired masses of the experimental mass data for a molecule. 

A molecule database is any compilation of information about characteristics 
of molecules. A molecxde database can be a biological molecule database. 
Databases are the preferred method for storing both poljTpeptide amino acid 
sequences and the nucleic acid sequences that code for these polypeptides. The 
databases come in a variety of different types that have advantages and 
disadvantages when viewed as the hypothesis for a polypeptide identification 
experiment. 

While the "database entry" for an amino acid sequence may appear to be a 
simple text file for a user browsing for a particxUar polypeptide, many databases 
are organized into very flexible, complicated structures. The detailed 
implementation of the database on a particular system may be based on a 
collection of simple text files (a "flat-file'' database), a collection of tables (a 
"relational" database), or it may be organized around concepts that stem firom the 
idea of a protein, gene, or organism (an "object-oriented" database). 

Protein mass data may be predicted from nucleic add sequence databases. 
Alternatively, protein mass data may be obtained directly fix)m protein sequence 
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databases that contain a collection of amino acid sequences represented by a 
string of single-letter or three-letter codes for the residues in a polypeptide, 
starting at the N-tefminus of the sequence. These codes may contain 
nonstandard characters to indicate ambigmty at a particular site (such as "B'* 
5 indicating that the residue may be TD" (aspartic acid) or ^'W (asparagine)). The 
sequences typically have a unique nvunber-letter combination associated with 
them that is used internally by the database to identify the sequence, usually 
referred to as the accession number for the sequence. 

Databases may contain a combination of amino acid sequences, comments, 

10 literature references, and notes on known posttranslational modifications to the 
sequence. A database that contains these elements is referred as "annotated'*. 
Annotated databases are used if some functional or structiiral information is 
known about the mature protein, as opposed to a sequence that is known only 
from the translation of a stretch of nucleic acid sequence. Non-annotated 

15 databases only contain the sequence, an accession mmxber, and a descriptive 
title. 

The backgroimd information known about an experimental molecide by 
which the data base search can be constrained can include any information. Some 
examples of backgroimd information include information about the species of an 
20 experimental biological molecule, knowledge or an assumption about the mass of 
the experimental biological molecule and the isoelectric point of the experimental 
biological molecule. 

For example, the observed molecular mass or the observed isoelectric point 
of a protein can be used in combination with the measured masses of peptides 

25 generated by proteolysis to constrain the search for a polypeptide. In particular, 
the comparison between the theoretical mass data of the database proteins and 
the mass data of the unknown protein may be constrained to only those proteins 
of the database which are within a chosen mass range. The chosen mass range is 
preferably within 50% of the mass of the imknown protein, more preferably 

30 within 35%, most preferably within 25%. Similarly, the comparison between the 
tiieoretical mass data of the database proteins and the mass data of the imknown 
protein may be constrained to only those proteins of the database which are 
within a chosen isoelectric point range. The isoelectric point (pi) of a protein is 
the pH at which its net charge is zero. The chosen isoelectric point range is 

35 preferably within 50% of the isoelectric point of the unknown protein, more 
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preferably within 35%, most preferably within 25%. 

Optionally, further information of the experimental biological molecule, 
such as a protein's se quence, is obtained by generating fragment mass data of the 
experimental and theoretical biological molectdes. Ftagment mass data for a 
peptide can be generated in any manner which provides fragment mass data 
within a certain accuracy. Experimental conditions include the type of energy 
used to generate the fragment mass data. Vibrational excitation enei^ can be 
used. The vibrational excitation may be generated by collisions of the peptide 
with electrons, photons, gas molecules or a surface. Electronic excitation can be 
used. The electronic excitation may be generated by collisions of the peptide with 
electrons, photons, gas molecules (e.g. argon) or a surfeice. 

In another example, the experimental fragment mass spectriim of a peptide 
from an enzjanaticaUy digested unknown protein is compared with the 
theoretical masses calculated by applying the rules for the ^dfidty of the 
enz3ane, and the rules for the fragmentation as known to those of ordinary skill 
in the art, to the amino acid sequence of a database protein. 

Fragment mass data for the purposes of this invention can be generated by 
usiag mxiltidimensional mass spectrometry (MS/MS), also known as tandem 
mass spectrometry. A number of lypes of mass spectrometers can be used 
including a triple-quadruple mass spectrometer, a Fourier-transform cyclotron 
resonance mass spectrometer, a tandem time-of-flight mass spectrometer, and a 
quadruple ion trap mass spectrometer. A single peptide from a protein digest is 
subjected to MS/MS measurement and the observed pattern of fragment ions is 
compared to the patterns of fragment ions predicted from database sequences. 

In one embodiment, the invention provides a method to determine the 
probabihties for the scores that a particular molecule in a database can 3deld by 
chance when compared with mass data. The method can operate under a variety 
of experimental and database search constraints. The score can be the number of 
matches between masses derived from known or hypothetical molecules or 
molecular constituents in a database and masses in mass data from one or 
several known or unknown molecules, or molecular constituents. The score can 
also result from a computation that utilizes the number of matches. 

In one embodiment, the invention provides a method to extract information 
about the molecules in a database. Examples of information that can be extracted 
from a database are total molecular mass, charge, isoelectric point, 
hydrophobidty and known or hypothetical chemical modification, and mass, 
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charge, isoelectric point, hydrophobidty and known or hypothetical chemical 
modification of molecular constituents. 

In one embodiment, the invention provides a method to perform actions on 
molecules in the database that are supposed to mimic actions occurring in an 
5 e^eriment. Examples of actions are degradation of molecules into molecular 
constituents by hydrolysis, where hydrolysis can result from the activity of 
chemicals or en2ymes. The method can also perform actions that mimic 
experimental actions on molecular constituents. For example, the fragmentation 
of an excited molecular constituent into smaller pieces. 

10 In one embodiment, the invention provides a method to derive a number of 

molecular pieces, feu, residting from an action assxmied to mimic an experimental 
situation. The pieces can be molectilar constituents, such as proteolytic peptides 
resulting from enzjmiatic digestion of a protein, where di£G3rent assiunptions can 
be made concerning the degree of completeness of the enzymatic digestion. The 

15 pieces can be molecular constituents in the form of fragments of molecular 
constituents, e.g. fragments of proteolytic peptides. 

In one embodiment, the invention provides a method to organize the masses 
of molecules or molecular constituents or fragments thereof. Examples of such 
organization are given in Pig. 1 and 2, where Fig. 1 displays the number of 

20 proteins in a database that match a given proteolytic peptide mass and Fig 2 
displays the clustered distribution of proteoljrtic peptide masses. Masses 
clustering in this or similar fashions will be referred to as a mass distribution 
peak. Mass distribution peaks can be found for all molecules that contain a 
hmited number of different atoms (e.g. C, H, N, O, S). 

25 In one embodiment, the invention provides a method for defining mass 

regions wherein the frequency of various masses can be determined. The method 
defines fi as the fraction of masses of molecular constituents or fragments that 
fidls into a mass region i. 



In one embodiment, the invention provides a method that determines a 



30 probability pi that a particular molecule in a database will be found in a 



where F is a function, m< is a mass region, and c denotes e3q)erimental and 
database search constraints. 



randomly chosen mass distribution peak in the mass region i: 



35 



In one embodiment pi is given by: 
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which describes the probability that a molecular constituent from a particvilar 
molecule characterized by ku will be found in a single rcmdomly chosen mass 
distribution peak. The denominator of the expression above describing pi 
5 represents the number of mass distributions peaks within the mass region L 

In one embodiment the invention provides a method of determining the 
probability, pi\ of finding a molecular constituent originating fix)m a particular 
molecule characterized by ku within a region dbd/n. around a randomly chosen 
molecvdar constituent mass m: 

where d(mi. Am) denotes a function that depends on the shape of the mass 
distrihution peak and mi refers to a mass region, d(mu Am) can be interpreted as 
a statistical measure of the number of molecular constituent masses that can be 
&und within ±Am from a randomly chosen molecular constituent mass. The mass 
15 accuracy Am can be different for different mass regions, ie., in that case denoted 
by Amu 

In one embodiment, the invention provides a method to determine 8(miy Am) 
by simulation of the relative frequency of masses around a randomly chosen mass 
in a mass distribution. In one embodiment, d(mi. Am) is determined by 
20 integration of a function describing molecular constituent mass distributions and 
normalization to the total number of molecular constituent masses in a mass 
distribution peak. In one embodiment, d(mi. Am) is determined by direct counting 
followed by normalization. 

In one embodiment of the invention, a finite number of mass regions 
25 between mmm and mmax is employed, each having an individually defined pi \ 

In one embodiment the probabilities pi' bxq employed to compute a total 
probability, p(k), for an individual molecule in the database to match randomly k 
out of n masses, where the n masses refers to the number of masses in the mass 
data. 

30 p(k)^G(jp,\k,n,c'), 

where G is a function and c' denotes experimental and database search 
constraints. 

In one embodiment of the invention p(k) is given by: 
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where q denotes the number of mass regions, ni denotes the number of masses in 
the mass data that are in mass region 1, na denotes tiie niunber of niasses in the 
mass data that are in mass region 2 etc., and fe, where £=l,2,...,qf, denotes the 
5 number of matches in mass region i. The values of iki are all combinations of 
values that apply to the constraint = k . 

In one embodiment of the invention, a score related to random matching is 
employed in the process of ranking molecules in a database. 

In one embodiment of the invention, the probabiHty p(k) is employed in the 
10 process of r ankin g molecules in a database. A whole database or a fraction of a 
database is processed and organized to allow the computation of p(k) for 
molecules in the database, k denotes the number of matches between the masses 
of molecular constituents of each database molecule investigated and masses in 
the mass data. The molecules in the database can be known or hypothetical. The 
15 molecule or molecules producing the mass data can be known or unknown. 

In one embodiment of the invention, the ranking of the molecules in a. 
database is based on the score S^(k)), where S is a function. 

In one embodiment of the invention 

S{p(k)) = c . (1 - Zp(f^')> = ct, Pik') . 

kf<k k 

20 where c is a constant or a mathematical function. When c=l, S(p(k)) can be 
interpreted as the probability that a molecule in the database would jrield at least 
k random matches with the mass data. 

In one embodiment of the invention, the molecule in the database that 
yields the lowest SCf>(k)) for k matches with the mass data is given the highest 

25 rank. The molecule in the database yielding the second lowest S(p(k)) for k 
matches is given the second highest rank and so on. The identification of a 
molecule or molecules is among the molecules having the highest ranks. The 
highest ranks can be the top ranked molecule only, but it can also be more 
molecules than the top ranked, e.g. the top two, top three, top four, top five, top 

30 ten, or top 100. The number of ranked molecules that are considered as 
identification results can also be determined by the use of a significance test. 
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In one embodiment, the invention provides a method of generating a 
frequency distribution of scores for a particular experimental condition, wherein 
the scores relate to random identifications of proteins. 

A frequency distribution is any compilation of the observed values of the 
5 variable being studied and how memy times each value is observed. Frequency 
distributions can be in the form of a table of listings, a bar graph, a histogram, a 
frequency polygon, or a continuous curve. Functions derived from frequency 
distributions can be continuous (probability density function) or discrete 
(probabihty mass functions)* Cumulative distribution functions of each type of 
10 frinction can also be derived. 



with if members from a database. 

In one embodiment, the sub-population is selected based upon values of feu. 
In one embodiment, the frequency frmction is generated for molecules 
15 ranked upon their number of matches. 

In one embodiment, the frequency function is /(S), where S is a score. In one 
embodiment, £f is the number of random matches. 
In one embodiment S=k ' and 



20 where p(k) has the meaning stated above. 

Those of ordinary skill in the art will recognize that the present invention 
has wide applicability for identification of molecules. Although illustrative 
embodiments of the present invention have been described herein with reference 
to the acoompansring drawings, it is to be understood that the invention is not 

25 limited to those precise embodiments, and that various other changes and 
modifications may be effected therein by one skilled in the art without departing 
from the scope or spirit of the present invention. 



In one embodiment, the frequency fanction is generated for a sub-population 
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Claizns 

1. A method of assigning an identity to one or several different molecules in a 
sample by comparison of characteristics obtained imder certain conditions for 
said sample with stored characteristics of individual ("stored") molecules, which 

5 method is characterized by the steps of: 

a) determining the nvimber, fe, of matches between stored characteristics 
of said individual molecules and characteristics observed from tlie 
sample; 

b) computing the probability, p(k), that a particular one of said stored 
10 individual molecules has characteristics that match randomly the 

characteristics of the sample; 

c) assigning an individual score, 8(p(k)), for a number of said stored 
molecxdes based on number of matches determined in step (a) and the 
probability computed in step (b); 

15 d) ranking each of the individual stored molecules that in step (c) have 

been assigned an individual score according to this score; and 
e) assigning an identity to one or several molecules, the characteristics of 
which were obtained under certain conditions based on the ranking in 
step (d). 

20 

2. The method according to claim 1, further characterized in that the 
determination of the number of matches in the step (a) of determining the 
number of matches in claim 1 is between characteristics of stored molecules 
assuming that these molecules have been subjected to the same conditions as the 

25 molecules in the sample. 



3. The method according to daim 1 or 2, further characterized in that said 
characteristics are masses of the constituents of the stored molecules, which 
masses cluster in mass distribution peaks, and that the step (b) of computing a 
30 probabihty in claim 1 comprises the steps of: 

a) determining the masses and the number, feu, of masses that can be 
generated for the particular condition for each individual molecule of 
the stored molecules; 

b) de fin i n g a total number, 9, regions, i, of the masses that have been 
35 computed in step (a); 
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c) determining a fraction, /t, of all the masses computed in step (a) that 
are within a region i as defined by step (b); 

d) calculating a probability, pi =ff ^ J(m„Am), where the 

denominator is the number of mass distribution peaks in the mass 
region i defined in step (b) above, and S(mi,Am) is a statistical measure 
of the number of constituent masses that can be found within ±Am 
from a randomly chosen molecular constituent mass, which means that 
the probability of finding a molecular constituent originating from 
a particular stored molecule within a region ±Am around a randomly 
chosen constituent mass; 

e) determining the probabihties as described in step (d) for aU regions 
defined in step (b); 

f) determining the number, m, of masses in the mass data that fall into 
each of the q the mass regions i defined by step (b); and 

g) determining the probability 

for a particular individual stored molecvile to match randomly h out of 
n masses, where the n masses refers to the number of masses in the 
mass data. 

4. The method according to anyone of claims 1 to 3, characterized in that 
said characteristics are masses of the constituents of the stored molecules, which 
masses cluster in mass distribution peaks, and that the step (c) of assigning an 
individual score in claim 1 comprises the step of calculating the score according to 
5(p(/fc)) = c . (1 - Y.P^^ » where c is a constant or a function or operator. 

5. The method according to anyone of claims 1 to 4, characterized in that 
said molecules are biological molecules. 

6. The method according to anyone of claims 3 to 5, characterized in that 
said masses cure obtained with mass spectrometry. 
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7. A method to determine a frequency function, f(S), of random molecule 
identification based on the method of computing the probability p(k) according to 
claim 1, which method is characterized by the steps of: 

a) defining a sub-population, with H members, of the stored molecules; 
and 

b) calculating the firequency function according to 

= "{g^^ . where 

8. A method to determine a firequency function, f(S), of random molecule 
identification based on the method of computing the probability p(k) according to 
claim 3, which method is characterized by the steps of: 

c) defining a sub-population, with H members, of the stored moleciiles 
where the members of the sub-population are selected based on their 
values of felt; and 

d) calculating the fi:«quenpy function according to 

-^^"^^"{g^^ ^ where S=*'. 
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